This is the notebook developed for the course of Data Science for Economics and Finance by Dario Cagetti (0000918366), Giuseppe Difazio(0000918394), Manuel Rech (0000914421) e Michele Vagnetti (0000918419).
Welcome to the year 2912. We’ve received a transmission from four lightyears away and things aren’t looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension! In this competition, your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic’s collision with the spacetime anomaly. To help you make these predictions, you’re given a set of personal records recovered from the ship’s damaged computer system
The dataset we have decided to use is derived from a Kaggle competition
train_df$HomePlanet = as.factor(train_df$HomePlanet)
train_df$CryoSleep = as.factor(train_df$CryoSleep)
train_df$Destination = as.factor(train_df$Destination)
train_df$VIP = as.factor(train_df$VIP)
train_df$Name = as.factor(train_df$Name)
train_df$Transported = as.factor(train_df$Transported)
Let’s start to grasp an idea of what some dependencies in our
dataset. We start out with a mosaic plot using the predictors
HomePlanet and CryoSleep.
This charts shows the number of transported and non-transported as a function of the various predictors. Take the chart with CryoSleep against Transported, it can give us an intuition in two ways:
looking at the boxes size -> we see that if CryoSleep was
True much more passengers were transported rather than not
transported. On the contrary if it was False much more
people were not transported rather than transported.
looking at the chart colors -> transparent blocks refers to
independence, while blue colors state that ‘there are
more occurrences in this class with respect to the independent case’ and
red ‘there are less occurrences in this class with respect to the
independent case’. Therefore, the more intense is the color of the
boxes, the more dependence there is between that variable and the
outcome Transported
The same intuition can be developed also with the other categorical variables
## [1] TRAPPIST-1e PSO J318.5-22 55 Cancri e <NA>
## Levels: 55 Cancri e PSO J318.5-22 TRAPPIST-1e
To grab some intuition on how the raw variables look like, here is the summary of the summary of the numeric predictors.
## Age RoomService FoodCourt ShoppingMall
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:19.00 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median :27.00 Median : 0.0 Median : 0.0 Median : 0.0
## Mean :28.83 Mean : 224.7 Mean : 458.1 Mean : 173.7
## 3rd Qu.:38.00 3rd Qu.: 47.0 3rd Qu.: 76.0 3rd Qu.: 27.0
## Max. :79.00 Max. :14327.0 Max. :29813.0 Max. :23492.0
## NA's :179 NA's :181 NA's :183 NA's :208
## Spa VRDeck
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 0.0
## Mean : 311.1 Mean : 304.9
## 3rd Qu.: 59.0 3rd Qu.: 46.0
## Max. :22408.0 Max. :24133.0
## NA's :183 NA's :188
By looking at this graph we can notice that people older than 30 y.o. have been transported with pretty much the same distribution. Before that level we can highlight how people in their twenties has been sacrificed, probably to help younger and older ones. Indeed, we can also see how different the distribution for minor than 10/15 y.o. is, here we have a lot more people of this age transported.
In both violin plots we can see almost the same behaviour. Not much differences between non-transported and transported with the orange observations being a bit thicker above about 50. This means that not a lot of passengers spent a lot in these plus services and also that spending more did not ensure you a preferential treatment when the accident has occurred.
Here we can notice a few things. First that among all the people that spent less than 30 in the VRDeck services the majority was not transported. In contrast, also here we can see a thicker orange values above 50 meaning that not a lot of passengers spent a lot on this service and that, again, spending more did not means a better treatment in case of casualty.
As in the previous case also here we can observe a similar distribution in both the plots. Anyway, unlike before we have a different pattern. Passengers that spent very little has hardly been transported, indeed people that spent, not a lot, but a discrete sum on those services, have more probability of being transported. For the third time we can spot a thicker orange values after the most of the observations meaning, again, that not a lot of passengers spent a lot on these services and that spending more did not means a better treatment in case of casualty.
To start out let’s see how the initial dataset looks like.
head(train_df)
Seeing from the information reported about the variables, we can
notice that two columns, PassengerId and Cabin
hide some information that may be exploited by creating multiple
columns.
Firstly we split PassengerId, built up as ‘gggg_pp’
head(train_df$PassengerId)
## [1] "0001_01" "0002_01" "0003_01" "0003_02" "0004_01" "0005_01"
group is the ‘gggg’ part, and references to the
group they belong to, think of a family as a group
number_within_group is the ‘pp’ part and makes
reference to the number within the group these people take
Secondly, we split the Cabin column, built as
‘deck/num/side’,
head(train_df$Cabin)
## [1] "B/0/P" "F/0/S" "A/0/S" "A/0/S" "F/1/S" "F/0/P"
side can be either P for Port or S for
Starboard
deck references to the position on the ship
num is another futher categorization of the place
took for travelling
Let’s graph the mosaic plot also for these three newly created variables
The datasets, after pre-processing are loaded into the R environment and as the categorical variables as themselves are just a list of characters, it is appropriate to convert them into factors. However, some factor present levels that are too many, sometimes almost as many as the rows.
As intuition suggests the variable Name may be
irrelevant for prediction of a passenger to be transported, indeed it
has 8473 levels, on a dataset with 8693 observations. For this reason we
drop it.
A similar reasoning is done with the variables Num with
1817 levels, derived from the old variable Cabin (that was
decomposed as Deck/Num/Side).
Notice that Deck has 2 levels and Side 2
instead.
Also the variable Group with 6217 levels is dropped.
This was derived from the Passenger_id variable, and
divided into Group and number_within_group,
the latter with 8 levels
After, this preliminary manual selection of the predictors, we want to count and take care of the missing values appropriately. This is how many NA values we have per variable
## [1] "train dataframe missing values"
## HomePlanet CryoSleep Destination Age
## 201 217 182 179
## VIP RoomService FoodCourt ShoppingMall
## 203 181 183 208
## Spa VRDeck Transported deck
## 183 188 0 199
## side number_within_group
## 199 0
## [1] "test dataframe missing values"
## HomePlanet CryoSleep Destination Age
## 87 93 92 91
## VIP RoomService FoodCourt ShoppingMall
## 93 82 106 98
## Spa VRDeck deck side
## 101 80 100 100
## number_within_group
## 0
Now, using the library ImputeMissing we substitute the
missing values using the following criteria:
median for numerical variables
mode for categorical variables
## [1] "train dataframe missing values after imputation"
## HomePlanet CryoSleep Destination Age
## 0 0 0 0
## VIP RoomService FoodCourt ShoppingMall
## 0 0 0 0
## Spa VRDeck Transported deck
## 0 0 0 0
## side number_within_group
## 0 0
## [1] "test dataframe values after imputation"
## HomePlanet CryoSleep Destination Age
## 0 0 0 0
## VIP RoomService FoodCourt ShoppingMall
## 0 0 0 0
## Spa VRDeck deck side
## 0 0 0 0
## number_within_group
## 0
## Age RoomService FoodCourt ShoppingMall
## Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.:20.00 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median :27.00 Median : 0 Median : 0.0 Median : 0.0
## Mean :28.79 Mean : 220 Mean : 448.4 Mean : 169.6
## 3rd Qu.:37.00 3rd Qu.: 41 3rd Qu.: 61.0 3rd Qu.: 22.0
## Max. :79.00 Max. :14327 Max. :29813.0 Max. :23492.0
## Spa VRDeck
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 0.0
## Mean : 304.6 Mean : 298.3
## 3rd Qu.: 53.0 3rd Qu.: 40.0
## Max. :22408.0 Max. :24133.0
Here we can see the numeric predictors for this dataset: there are 5 predictors regarding expenses done by passengers in the spaceship, and 1 predictor indicating the age of the passengers. As the range of values that these predictors take is not the same and nor is the mean, we proceed with scaling these predictors in order to reach mean = 0 and standard deviation equal to 1
train_df[ , numeric_cols_train] = scale(train_df[, numeric_cols_train])
test_df[ , numeric_cols_test] = scale(test_df[, numeric_cols_test])
summary(train_df[, numeric_cols_train])
## Age RoomService FoodCourt ShoppingMall
## Min. :-2.0075 Min. :-0.3331 Min. :-0.2810 Min. :-0.2836
## 1st Qu.:-0.6129 1st Qu.:-0.3331 1st Qu.:-0.2810 1st Qu.:-0.2836
## Median :-0.1248 Median :-0.3331 Median :-0.2810 Median :-0.2836
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5724 3rd Qu.:-0.2710 3rd Qu.:-0.2428 3rd Qu.:-0.2468
## Max. : 3.5010 Max. :21.3574 Max. :18.4013 Max. :39.0003
## Spa VRDeck
## Min. :-0.2706 Min. :-0.2630
## 1st Qu.:-0.2706 1st Qu.:-0.2630
## Median :-0.2706 Median :-0.2630
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2235 3rd Qu.:-0.2277
## Max. :19.6377 Max. :21.0159
Now, the dataset has been processed and now we are ready to try some machine learning models to understand which is the best one to make predictions onto unknown data
## The mean of transported in train_set is 0.5036426 while the mean of transported in validation_set is 0.5035951 , so the division seems to be done evenly
head(train_df)
Now, we know we have a classification problem, not a regression, so using methods as BSS, FSS, BSS is not the best choice, however, this can give a sense of what variables may be more relevant. As we’ll see some of the findings found here are in line with predictions done by other classifiers.
We are using the library Leaps for this simple
regression and the choice of selection models falls into the FSS. This
is because Forward, Backwards and Best Subset Selection yield a very
similar result, and therefore we choose the model that is less
computational expensive, for optimization
As we know from theory, when we fit a linear regression model, we minimize the Residual Sum of Squares of the training set and consequently the Mean Squared Error (that is, MSE = RSS/n). However, as we care about model accuracy, therefore the ability of this regression of fitting unseen data, we do not pay too much attention on the training MSE tends, we prefer instead the test MSE. The latter usually is underestimated by the prior (training MSE is monotonically decreasing while test MSE is u-shaped on the axis MSE, flexibility) and therefore we use some different approaches to estimate the test MSE: Mallow’s Cp, Bayesian Information Criteria and the adjusted R2 coefficient.
We are not going into details of the meaning of these methods, but it is sufficient to state that, after we performed the same steps with the different methods, we ended up having very very similar results, therefore we only show here results for Cp criterion.
## model estimation
forward_subset_selection = regsubsets(Transported ~ ., data = train_set, nvmax = 27, method = "forward")
However, as we have a classification task it’s more proper to use a
logistic regression instead of a classification one. Furthermore, we
also perform some subset selection using the function step,
that by default is doing a backward step-wise selection
set.seed(1)
logistic = glm(Transported ~ ., data=train_set, family = binomial())
bestlogit <- step(logistic, trace = FALSE)
bestlogit$anova
The call anova on the bestlogic moduel states that the
predictors ‘VIP’ and ‘number within group’ are not relevant for the
classification and therefore are omitted from the optimal model
estimated
summary(bestlogit)
##
## Call:
## glm(formula = Transported ~ HomePlanet + CryoSleep + Destination +
## Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck +
## deck + side, family = binomial(), data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9864 -0.6692 0.0372 0.6982 3.2356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.18367 0.38244 -5.710 1.13e-08 ***
## HomePlanetEuropa 1.61778 0.27410 5.902 3.59e-09 ***
## HomePlanetMars 0.49322 0.11900 4.145 3.40e-05 ***
## CryoSleepTrue 1.42710 0.10110 14.116 < 2e-16 ***
## DestinationPSO J318.5-22 -0.30240 0.14440 -2.094 0.036245 *
## DestinationTRAPPIST-1e -0.33654 0.10337 -3.256 0.001131 **
## Age -0.13225 0.03834 -3.449 0.000562 ***
## RoomService -1.00264 0.07976 -12.571 < 2e-16 ***
## FoodCourt 0.82014 0.08312 9.866 < 2e-16 ***
## ShoppingMall 0.33845 0.04983 6.792 1.10e-11 ***
## Spa -2.35344 0.15630 -15.058 < 2e-16 ***
## VRDeck -2.17879 0.15515 -14.043 < 2e-16 ***
## deckB 1.10984 0.33127 3.350 0.000807 ***
## deckC 2.45846 0.37218 6.606 3.96e-11 ***
## deckD 0.82780 0.36680 2.257 0.024020 *
## deckE 0.09739 0.37001 0.263 0.792391
## deckF 0.88112 0.37057 2.378 0.017419 *
## deckG 0.46343 0.38094 1.217 0.223776
## deckT -0.16451 1.92734 -0.085 0.931977
## sideS 0.58435 0.07464 7.829 4.90e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7230.6 on 5215 degrees of freedom
## Residual deviance: 4471.1 on 5196 degrees of freedom
## AIC: 4511.1
##
## Number of Fisher Scoring iterations: 7
also, as we can see from this summary, the deck predictor is encoded in its levels minus 1 to avoid collinearity. However the groups G, T and E estimates are not considered very significant. To exploit all the predictive power of the model, let’s make an attempt to join these three levels in the factor ‘deck’ and re-estimate the model.
The level EGT comprises the three levels E, G and T now.
set.seed(1)
logistic2 = glm(Transported ~ ., data=train_set_logistic, family = binomial())
bestlogit2 <- step(logistic2, trace = FALSE)
summary(bestlogit2)
##
## Call:
## glm(formula = Transported ~ HomePlanet + CryoSleep + Destination +
## Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck +
## deck + side, family = binomial(), data = train_set_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9181 -0.6679 0.0371 0.6920 3.1951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.28904 0.37543 -6.097 1.08e-09 ***
## HomePlanetEuropa 1.76549 0.26640 6.627 3.42e-11 ***
## HomePlanetMars 0.71184 0.10511 6.772 1.27e-11 ***
## CryoSleepTrue 1.28408 0.09386 13.680 < 2e-16 ***
## DestinationPSO J318.5-22 -0.32554 0.14482 -2.248 0.024589 *
## DestinationTRAPPIST-1e -0.35353 0.10318 -3.427 0.000611 ***
## Age -0.11220 0.03783 -2.966 0.003019 **
## RoomService -1.00109 0.07966 -12.567 < 2e-16 ***
## FoodCourt 0.81332 0.08294 9.806 < 2e-16 ***
## ShoppingMall 0.34373 0.05010 6.861 6.82e-12 ***
## Spa -2.33047 0.15505 -15.031 < 2e-16 ***
## VRDeck -2.15059 0.15382 -13.981 < 2e-16 ***
## deckB 1.13252 0.32652 3.468 0.000523 ***
## deckC 2.42304 0.36633 6.614 3.73e-11 ***
## deckD 0.77079 0.36028 2.139 0.032403 *
## deckF 0.14513 0.36383 0.399 0.689983
## deckEGT 0.81499 0.36401 2.239 0.025161 *
## sideS 0.58644 0.07452 7.870 3.55e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7230.6 on 5215 degrees of freedom
## Residual deviance: 4487.6 on 5198 degrees of freedom
## AIC: 4523.6
##
## Number of Fisher Scoring iterations: 7
bestlogit2$anova
Also, by calling again the anova function on the bestlogit2 module, we see that the same two predictors are eliminated, therefore we can assume that, using these criteria we have reached an optimum for the logistic regression
## Logistic AUC: 0.8727135
Given that subset selection generally led to over-fitting data, let assess the performance of new models based on regularization methods, which reduce variance by adding a penalty function.
In order to fit our model using Ridge, Lasso and Elastic Net models we need to create four matrixes, two for x-values and y-values respectively, that contains observation of the train and the test test, which has been splitted randomly using stratified sampling technique.
set.seed(1)
X = model.matrix(Transported ~ . -1, data = train_df)
y = train_df$Transported
set.seed(1)
train.label = sample.split(train_df$Transported, SplitRatio = 3/5)
X.train = X[train.label, ]
y.train = y[train.label]
X.test = X[-train.label, ]
y.test = y[-train.label]
y.train = as.numeric(y.train) -1
y.test = as.numeric(y.test) - 1
Regularization methods minimize the RSS plus penalty parameter that it is called shrinkage penalty, which has the effect of shrinking the estimates of the model’s parameters towards zero. The amount of penalty is tuned by the parameter lamba that can assume any values in the range [0,+∞), where 0 represent the case when the outcome is exactly equal to a OLS regression and +∞ the case when there is only the intercept in the model, that is the mean value of the response.
We are using the elastic net formulation of the penalty function which weigh the lasso and the ridge penalties by coefficients α and (1-α), respectively. Thus, to have a ridge regression we put α equal to 0.
Unlike OLS, ridge regression creates different estimates of the parameters depending on the value of lambda, thus, we need to cross-validate its value to chose the best model.
set.seed(1)
cv.ridge = cv.glmnet(X.train, y.train, alpha = 0, family=binomial())
# alpha = 0 is for Ridge, alpha = 1 is for Lasso
plot(cv.ridge)
coef(cv.ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.591685071
## HomePlanetEarth -0.406999503
## HomePlanetEuropa 0.502547785
## HomePlanetMars 0.041909340
## CryoSleepTrue 1.506322382
## DestinationPSO J318.5-22 -0.171274572
## DestinationTRAPPIST-1e -0.221643907
## Age -0.123017919
## VIPTrue -0.224804558
## RoomService -0.557053667
## FoodCourt 0.294931913
## ShoppingMall 0.245987054
## Spa -0.747441128
## VRDeck -0.737830686
## deckB 0.489601986
## deckC 0.785854822
## deckD 0.011471137
## deckE -0.476742899
## deckF 0.081439620
## deckG -0.130125920
## deckT 0.147991266
## sideS 0.415878176
## number_within_group2 0.162559729
## number_within_group3 0.343219701
## number_within_group4 0.051401794
## number_within_group5 -0.243078487
## number_within_group6 0.004613333
## number_within_group7 0.230267703
## number_within_group8 -0.032922557
Given the bias-variance trade-off we select the best model under the one-standard error rule, basically selecting not the absolute best model but that that is within one-standard deviation from it.
## Ridge: 0.8690364
The Lasso regression is computed selecting α = 1, which is the standard default option in glmnet. Unlike Ridge, Lasso performs variables selection, which means that the coefficients estimate of some insignificant variables is exactly zero. Thus, providing a sparse model Lasso regression improves the interpretability of the model over the Ridge one, even though this not means that it is necessarily better in predicting data.
set.seed(1)
cv.lasso = cv.glmnet(X.train, y.train, family=binomial())
plot(cv.lasso)
coef(cv.lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.77011976
## HomePlanetEarth -0.44771369
## HomePlanetEuropa 0.71969559
## HomePlanetMars .
## CryoSleepTrue 1.45807539
## DestinationPSO J318.5-22 -0.03475148
## DestinationTRAPPIST-1e -0.12483670
## Age -0.08323170
## VIPTrue .
## RoomService -0.76957200
## FoodCourt 0.48946519
## ShoppingMall 0.26887562
## Spa -1.47885917
## VRDeck -1.35225864
## deckB 0.29279437
## deckC 1.01757061
## deckD .
## deckE -0.53289611
## deckF .
## deckG -0.23840332
## deckT .
## sideS 0.45061428
## number_within_group2 0.10493157
## number_within_group3 0.25154842
## number_within_group4 .
## number_within_group5 -0.02388122
## number_within_group6 .
## number_within_group7 .
## number_within_group8 .
For the bias-variance trade-off we select the best model under the one-standard error rule. As it is possible to see in the graph, this not only decreases the variance of the model but also allow us to simplify the model using “just” 18 variables rather than 23, improving the overall interpretability.
## Lasso: 0.8767769
The elastic net regression is performed assigning α a value in the range (0,1) and in this case we chose the mean value α = 0.5, even though a much coherent analysis should perform a cross-validation of the parameter α, to find the best model.
set.seed(1)
cv.elnet = cv.glmnet(X.train, y.train, alpha = 0.5, family = binomial())
plot(cv.elnet)
coef(cv.elnet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.75523336
## HomePlanetEarth -0.45343649
## HomePlanetEuropa 0.65912155
## HomePlanetMars .
## CryoSleepTrue 1.51841231
## DestinationPSO J318.5-22 -0.14023416
## DestinationTRAPPIST-1e -0.20032414
## Age -0.10541044
## VIPTrue -0.02410208
## RoomService -0.77057660
## FoodCourt 0.49048361
## ShoppingMall 0.29495620
## Spa -1.38635880
## VRDeck -1.29555362
## deckB 0.44087386
## deckC 1.12262722
## deckD .
## deckE -0.53210470
## deckF 0.06090069
## deckG -0.22368901
## deckT .
## sideS 0.47994901
## number_within_group2 0.14955858
## number_within_group3 0.31832917
## number_within_group4 .
## number_within_group5 -0.17064692
## number_within_group6 .
## number_within_group7 .
## number_within_group8 .
Following the same approach, we used the one-standard error rule. As it is possible to see from the graph, elastic net has some variables selection properties and it allows to decrease the variables in the model from 25 to 21, even though, as expected, it perform slightly worse than Lasso, which can eliminate three more variables.
## Elastic net alpha 0.5: 0.8763227
In order to enter into tree-based method, we plot the simplest case: the single tree.
set.seed(1)
tree.train = tree(Transported ~ ., train_set)
plot(tree.train)
text(tree.train, pretty = 0)
tree.pred = predict(tree.train, validation_set, type = "vector")
# cv.tree.train = cv.tree(tree.train, FUN = prune.misclass)
#
# plot(cv.tree.train$size, cv.tree.train$dev,
# type = 'b', pch = 19, col = 'red',
# xlab = 'Tree size', ylab = 'missclassification')
#
# prune.train = prune.tree(tree.train, best = 8)
# plot(prune.train)
# text(prune.train, pretty = 0)
## [1] 0.8312572
However, estimates for this methods will never perform as well as the Random Forest, which is an ensemble method using many of these trees.
Using the library randomForest we estimate a random
forest predictor using the default value of 500 trees and as the
documentation says, the rounded down to integer value of the square root
of number of columns, that is: 3.7 rounded to 3. As we’ll see this is
already quite an optimal result
set.seed(1)
#sqrt(ncol(train_set))
rf = randomForest(Transported ~ ., data = train_set)
rf
##
## Call:
## randomForest(formula = Transported ~ ., data = train_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.69%
## Confusion matrix:
## False True class.error
## False 2003 586 0.2263422
## True 441 2186 0.1678721
Now we want to fine tune the number of variables used in each of the
tree in the Random Forest predictor, and for this scope we may use a
function called tuneRF in the library
randomForest but we think that is not very robust as it may
lead to some misleading local optimum result.
We proceed instead by estimating as many random forest models as the predictors, changing the mtry at every iteration. By graphing this, we pick the mtry (number of predictors) that is bounded with the lowest out-of-bag error and/or test error.
As we can see mtry = 2 is the optimal value, let’s
estimate now the best random forest
set.seed(1)
rf2 = randomForest(Transported ~ ., data = train_set, mtry = 2, ntree = 500)
rf2
##
## Call:
## randomForest(formula = Transported ~ ., data = train_set, mtry = 2, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.82%
## Confusion matrix:
## False True class.error
## False 2008 581 0.224411
## True 453 2174 0.172440
And also the chart of the most important variables using the Mean Decrease in the Gini index criterion.
This instead is the ROC curve for this Random Forest
with an accuracy of
## RandomForest (m = 3): 0.8641943
## RandomForest (m = 2): 0.8579669
Still in the tree based methods, we can use a bagging algorithm. This follows the same logic as the random forest classifier, with the only difference of the number of variables taken to estimate each tree in the forest. In this case it is the maximum number of predictors
set.seed(1)
bagging = randomForest(Transported ~ ., data = train_set, mtry = 13, ntree = 500)
Let’s plot now the ROC curves for this models
Let’s take a look at the Area Under Curve for these two models
## Bagging (m = 8): 0.8568759
Now let’s implement a boosting algorithm. This is another ensemble method of many trees that, differently from Random Forests and Bagging, it learns slowly. Trees indeed have very few terminal nodes and each subsequent tree is built on the previous, in order to always improve the estimating power of the tree in areas where it is not performing well.
Using the parameter cv in the gbm function,
we can perform a cross validation to select the optimal number of trees
to be estimated in order to have better predictive ability
set.seed(1)
cv.boost.train = gbm(Transported ~ .,
data = train_set,
distribution = 'bernoulli',
n.trees = 10000,
shrinkage = 0.01,
interaction.depth = 2,
cv.folds = 10,
n.cores = 2)
# set.seed(1)
# cv.boost.train1 = gbm(Transported ~ .,
# data = train_set,
# distribution = 'bernoulli',
# n.trees = 10000,
# shrinkage = 0.01,
# interaction.depth = 1,
# cv.folds = 10,
# n.cores = 2)
best.iter = gbm.perf(cv.boost.train, method = "cv")
# best.iter1 = gbm.perf(cv.boost.train1, method = "cv")
legend("topright", legend = c("Train","Validation"),
pch = 19, col = c("black","Yellow"))
cat('The optimal numer of trees is: ', best.iter)
## The optimal numer of trees is: 5690
We now fit the boosting algorithm using this number of trees
set.seed(1)
best.boost.train = gbm(Transported ~ .,
data = train_set,
distribution = 'bernoulli',
n.trees = best.iter,
interaction.depth = 2,
shrinkage = 0.01)
# best.boost.train1 = gbm(Transported ~ .,
# data = train_set,
# distribution = 'bernoulli',
# n.trees = best.iter,
# interaction.depth = 1,
# shrinkage = 0.01)
And this is the performance of the tree
## [1] 0.882844
Support Vector Machine (SVMs) classifier for binary problems is a machine learning technique that separates the hyperplane built by the predictors in two ‘parts’ in order to have all the observations belonging to one class into the same part while all the others in the other part
As all machine learning models we have a parameter to set in order to
constrain flexibility and obtain the minimum test error possible. In
this case, we have the cost parameter which controls the
penalty of misclassification. In other words how many wrong observation
we allow in order to avoid overfitting the data but also fitting it, so
not underfitting the data.
set.seed(1)
# run 10-CV
# linear_svm = tune(svm, Transported ~ ., data = train_set,
# kernel = "linear",
# scale = TRUE,
# ranges = list(cost = c(0.1, 1, 5,6,7,8,9,10,11,12,13,14,15,20)))
load(file='tune_out_linear.Rdata')
tune.outL$best.model
##
## Call:
## best.tune(method = svm, train.x = Transported ~ ., data = train_set,
## ranges = list(cost = c(0.1, 1, 5, 6, 7, 8, 9, 10, 11, 12, 13,
## 14, 15, 20)), kernel = "linear", scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 6
##
## Number of Support Vectors: 2508
plot(tune.outL, main='error as a function of cost, linear svm')
As we can see from the chart and from the table above, the best SVM classifier is reached with the cost parameter = 6. Using this we can make predictions on the validation_set and plot the usual ROC curve and AUC.
## [1] 0.8615537
To account for non-linear boundaries we may try to use non-linear specifications of the SVM classifier.
Specifically we have decided to use the radial kernel.In this case, SVM is not anymore a global method as the liner SVC was. Indeed we are classifying unknown observations with known observation that sit on the nearby in terms of predictor specifications. This classifier somehow recall of the KNN classifier, with the difference that we are creating boundaries within the dataset and not classifying observation per observation.
set.seed(1)
# cross-validating the tuning parameters gamma and lambda
# svm_non_linear = tune(svm, Transported ~ ., data = train_set,
# kernel = "radial",
# ranges = list(cost = c(0.1,1,5,10,15,10,25,30),
# gamma = c(0.5, 1, 2, 3, 4)))
load('svm_non_linear.Rdata')
cat('Below there are the best parameters')
## Below there are the best parameters
svm_non_linear$best.parameters
plot(svm_non_linear, main = 'Optimal parameters for non-linear radial SVM')
However we may notice that the best pick for cost and gamma are the lowest value they can possibly take, in this case cost = 0.1 and gamma = 0.5. However let’s plot the ROC and calculate the AUC
## False True
## 1755 1722
## [1] 0.8518311
In order to fit the Neural Network we need to use the function
neuralnet, thus, since it works only with quantitative
variables we need to convert factor variables into a numerical dummies
performing a one-hot encoding of the datasets.
Firstly, we create a dataframe containing the categorical variables,
then we apply the function dummyVars that convert factors
into numerical dummy variables that are stored in a new dataset. In
order to avoid collinearity issues we delete one dummy for each
categorical variables. Finally, we add back the numerical variables that
are stored in the original dataset.
The O-H encoding is performed in the following order: 1.
train_set, 2. validation_set 3.
test_df.
Using neuralnet we can fit the model using as covariates
all the variables in the dataset, as shown by the formula
Transported.True ~. within the model. Since we are facing a
classification problem we want to minimize ce, the
cross-entrophy, which is the negative of the log-likelihood and using
the logistic function as activation function because the
output should take the form of probabilites, thus they must lie in a
range between [0,1].
Given the huge number of the tuning parameters in this kind of models and the need to spend lot of computational power performing cross-validation to find the optimal combination, we decided to estimate just a one hidden-layer network.
Selecting a single hidden-layer with 14 nodes we need to estimates a total of (27 + 1) x 14 = 392 parameters! This means that there is a tangible risk of over-fitting the data, which derives also form the huge variability of different combination of tuning parameters.
library(neuralnet)
set.seed(2020)
nn1 = neuralnet(Transported ~.,
data = train_set_new,
stepmax = 1e+7,
hidden = c(14),
algorithm = "rprop+",
err.fct = "ce",
act.fct = "logistic",
linear.output = FALSE,
lifesign = 'full',
lifesign.step = 1000,
threshold = 1,
rep = 3)
Here we can see the ROC curves and the performances for the Neural Networks
## Neural Network rep = 1: 0.8681485
## Neural Network rep = 2: 0.8591009
## Neural Network rep = 3: 0.8576715
Let provide a graphical representation of the 1st network that is the best-performing model, with an AUC of 0.868. that show the resulting coefficients for each node connection. As it is possible to see form the graph, it is extremely hard understand the connection between the variables, and this together with the relative low prediction results led us to select other simple and less-computational expense models.
## Rank AUC
## 1 Boosting 0.8824744
## 2 Elastic net 0.8765516
## 3 Lasso 0.8764930
## 4 Logistic regression 0.8727135
## 5 Ridge 0.8684710
## 6 Neural net 1 0.8681485
## 7 Random forest m = 3 0.8641943
## 8 Linear SVM 0.8615537
## 9 Neural net 2 0.8591009
## 10 Random forest m = 2 0.8579669
## 11 Neural net 3 0.8576715
## 12 Bagging 0.8568759
## 13 Non-linear SVM 0.8518311
Among all the methods we have used, the boosting turned out to be best performing in terms of AUC, with a value of 0.8824398, so we decided to use that model to get the prediction score of that model.
For the boosting, kaggle.com results were 0.79448.